This document describes the percentile as a robust measure of central tendency and spread within a distribution of values. Examples are given for its application in the context of soil data summaries.
Within a set of data, the n-th percentile describes the value below which n% of the data, when sorted, fall. For example, within the integer sequence spanning 0 to 100, 50 is the 50th percentile or median, 10 is the 10th percentile, and 90 is the 90th percentile.
Consider the following (hypothetical) field-described clay content from the A horizon of the same taxa:
11, 10, 12, 23, 17, 16, 17, 14, 24, 22, 14
sorted:
10, 11, 12, 14, 14, 16, 17, 17, 22, 23, 24
resulting: 10th (11), 50th (16), and 90th (23) percentiles.
Percentiles require no distributional assumptions and are bound to the data from which they are computed. This means that percentiles can provide meaningful benchmarks for both normal and non-normal distributions, and, the limits will always fall within the min/max of the observed data.
Direct interpretation; consider the 10th (P10) and 90th (P90) percentiles: "given the available data, we know that soil property p < P10 10% of the time, and, p < P90 90% of the time". This same statement can be framed using probabilities or proportions: "given the available data, soil property p is within the range of {P10 − P90} 80% of the time".
Percentiles are simple to calculate, requiring at least 3, better 10, and ideally > 20 observations.
The median is a robust estimator of central tendency.
The lower and upper percentiles (e.g. 10th and 90th) a robust estimator of spread.
Estimation of percentiles is based on ranking of the original data. Interpolation between observed values is required when there is a small number of samples. Consider the values (1,3,5,6,7,9,9,10). Estimation of the 10th, 50th, and 90th percentiles results in 2, 7, 10 respectively. Since we are not typically interested in the estimated percentiles verbatim, the interpolated estimates are close enough. The Harrel-Davis quantile estimator is a robust method for estimating quantiles, no matter the sample size.
The following figures demonstrate the relationship between distribution shape, measures of central tendency (mean and 50th percentile, and measures of spread (mean +/1 2 standard deviations, and 10th / 90th percentiles). Within each figure is an idealized normal distribution that is based on the sample mean and standard deviation. The y-axis can be interpreted as the "relative proportion" of samples associated with a value on the x-axis. The thick, smooth lines represent an estimate of density, a continuous alternative to the histogram (grey columns).
With a large enough sample size, the distribution of some soil properties can be approximated with the normal or Gaussian distribution. In this case, the mean and median are practically equal and the spread around the central tendency is balanced. Examples include lab measured clay content and pH from a suite of related samples (e.g. A horizons from a single soil series concept).
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 15.3 | 2 | -0.2 | 10.4 | 11.8 | 12.9 | 15.3 | 17.9 | 18.4 | 19.5 | 100 |
Various forms of the log-normal distribution are typically more accurate approximations of soil properties. Log-normal distributions with a "short tail", or a low degree of asymmetry around the central tendency (skewness), are common. Note the shift between mean and median, and the unequal distances to 10th and 90th percentiles. Examples include lab measured organic carbon and field measured rock fragment volume.
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 17.2 | 6.7 | 0.7 | 5.9 | 7.9 | 9.9 | 15.9 | 27.1 | 30 | 37.4 | 100 |
Log-normal distributions with a "long tail", e.g. more skewed, are commonly encountered when summarizing GIS data sources such as elevation, slope, and curvature. Note that the mean +/- 2SD is no longer a meaningful representation of spread around the central tendency.
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 38.8 | 51.8 | 2.6 | 0.7 | 2 | 3.9 | 18.6 | 106.4 | 148.6 | 304.1 | 100 |
In general, the further the departure from a normal distribution, the less meaningful mean and standard deviation are as metrics of central tendency and spread.
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 29.4 | 5.2 | 0.1 | 17 | 21.3 | 23.1 | 29.4 | 36.7 | 38.6 | 41.8 | 64 |
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.048 | 0.03 | 1.11 | 0.015 | 0.017 | 0.021 | 0.038 | 0.101 | 0.111 | 0.117 | 17 |
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 6.1 | 0.9 | 0.4 | 4.6 | 4.8 | 4.9 | 6 | 7.4 | 7.8 | 8.3 | 64 |
Many sampled values leads to more reliable estimates of central tendency and spread.
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 113.8 | 57.1 | 2.8 | 52.4 | 64.8 | 68.7 | 98.2 | 184.1 | 212.2 | 507.3 | 17014 |
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 6.4 | 5.8 | 2.4 | 0 | 0.9 | 1.4 | 4.7 | 13.2 | 17.4 | 60.2 | 17014 |
| mean | sd | skew | min | P5 | P10 | P50 | P90 | P95 | max | n |
|---|---|---|---|---|---|---|---|---|---|---|
| 508.7 | 46.2 | 3.5 | 450 | 464.8 | 469.9 | 498.2 | 548 | 567.9 | 807 | 17014 |